并行计算算子加速思路Reduce 算子调优Reduce 算子调优It's easy to implement in CUDA, but hard to get it right.本质上就是计算:x=x0⊗x1⊗...⊗xnx = x_0 \otimes x_1 \otimes ... \otimes x_nx=x0⊗x1⊗...⊗xn